Imagine that we are some fancy data scientists exploring - once again - the Gapminder data. We are particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS suggested that we use this tidyverse thing to complete our tasks. They also told us that we do not always need to load all of its packages at once.

1

Load all packages from the tidyverse that are needed for importing Excel data and data wrangling.
You can find the names of the required packages in the slides and exercises for sessions A3, (Tidy Data), A4 (Importing Data), and A5 (Data Wrangling 1).
library(readxl)
library(dplyr)
library(tidyr)

Ok, that wasn’t too hard. But data science is about data, so we have to load the data we are interested in.

2

Import the GDP data from the Excel file with the Gapminder GDP data as gap_gdp. Make sure to only import the Excel sheet named “Data”.
Individual sheets can be chosen by using the argument sheet = "name_of_your_sheet"
gap_gdp <- 
  readxl::read_excel(
    path = "../data/gapminder/GDPpercapitaconstant2000US.xlsx",
    sheet = "Data"
    )

# OR: alternatively you could also import the Gapminder GDP data from the csv file
# gap_gdp <-
#   readr::read_csv("../data/gapminder/gdppercapita_us_inflation_adjusted.csv")

Have the data been successfully imported? They should be in a tibble with the dimensions 275 x 53. As a further check: The income per person for Algeria for the years 1960, 1961, and 1962 should be 1280, 1085, and 856.

3

Check that the income per person for Algeria for the years 1960, 1961, and 1962 is 1280, 1085, and 856.
Algeria is in the 5th row of the dataset, and the relevant variables are in the first four columns. You can also subset datasets by selecting columns by number with select() and filtering rows by number with slice().
gap_gdp %>% 
  select(1:4) %>% 
  slice(n = 5)
## # A tibble: 1 x 4
##   `Income per person (fixed 2000 US$)` `1960.0` `1961` `1962`
##   <chr>                                   <dbl>  <dbl>  <dbl>
## 1 Algeria                                 1280.  1085.   856.

Let’s say that we are interested in the earliest 10 years as well as the most recent 10 years that appear in the dataset. If we want to aggregate the data per year, they should ideally be in long format.

4

Rearrange the data so that they are in the long format.
Remember that the command for converting wide format data to long format is gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() (GDP might be a good choice here) and change the variable type for year to integer.
gap_gdp %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  mutate(year = as.integer(year))
## # A tibble: 14,300 x 3
##    country                year   GDP
##    <chr>                 <int> <dbl>
##  1 Abkhazia               1960   NA 
##  2 Afghanistan            1960   NA 
##  3 Akrotiri and Dhekelia  1960   NA 
##  4 Albania                1960   NA 
##  5 Algeria                1960 1280.
##  6 American Samoa         1960   NA 
##  7 Andorra                1960   NA 
##  8 Angola                 1960   NA 
##  9 Anguilla               1960   NA 
## 10 Antigua and Barbuda    1960   NA 
## # ... with 14,290 more rows

There are still a lot of missing values we might want to get rid of, and the data are not arranged in a way that is ideal to explore changes over time. For the next tasks, simply re-use the previous code and add the following commands with %>%.

5

Remove all missing values and arrange the data in ascending order of year and GDP.
There are several ways to exclude missing values. The most convenient one is to use filter() in combination with !is.na.
gap_gdp %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  mutate(year = as.integer(year)) %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP)
## # A tibble: 7,988 x 3
##    country       year   GDP
##    <chr>        <int> <dbl>
##  1 Malawi        1960  98.6
##  2 China         1960 105. 
##  3 Burundi       1960 116. 
##  4 Burkina Faso  1960 122. 
##  5 Lesotho       1960 132. 
##  6 Nepal         1960 139. 
##  7 Togo          1960 177. 
##  8 India         1960 181. 
##  9 Pakistan      1960 187. 
## 10 Indonesia     1960 201. 
## # ... with 7,978 more rows

Now we have a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we have the values for all years between these two periods in the data. However, we might want to use some of these data points in future analyses. Hence, we will do the following analyses ‘on the fly’ (i.e., without creating a new dataset). Let’s start with the first period.

6

Calculate the (grand) mean value of the GDP of all countries for each of the first ten years in the dataset.
As the year variable is an integer, you can simply filter the range of years you are interested in. The first year in the dataset is 1960.
gap_gdp %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  mutate(year = as.integer(year)) %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP) %>% 
  filter(between(year, 1960, 1969)) %>% 
  group_by(year) %>% 
  summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
##     year GDP_over_all_countries
##    <int>                  <dbl>
##  1  1960                  2863.
##  2  1961                  2936.
##  3  1962                  3034.
##  4  1963                  3127.
##  5  1964                  3297.
##  6  1965                  3479.
##  7  1966                  3562.
##  8  1967                  3644.
##  9  1968                  3827.
## 10  1969                  4015.

Now it should be easy do the same for the 10 most recent years in the dataset…

7

Calculate the (grand) mean value of the GDP of all countries for each of the last ten years.
The most recent year in the dataset is 2011.
gap_gdp %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  mutate(year = as.integer(year)) %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP) %>% 
  filter(year >= 2002) %>% 
  group_by(year) %>% 
  summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
##     year GDP_over_all_countries
##    <int>                  <dbl>
##  1  2002                  7983.
##  2  2003                  8113.
##  3  2004                  8335.
##  4  2005                  8545.
##  5  2006                  8899.
##  6  2007                  9219.
##  7  2008                  8999.
##  8  2009                  8463.
##  9  2010                  7700.
## 10  2011                  7603.